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Abstract 

Lateral gene transfer (LGT) is a common mechanism of non-vertical 
evolution where genetic material is transferred between two more or less 
distantly related organisms. It is particularly common in bacteria where it 
contributes to adaptive evolution with important medical imphcations. In 
evolutionary studies, LGT has been shown to create widespread discordance 
between gene trees as genomes become mosaics of gene histories. In partic- 
ular, the Tree of Life has been questioned as an appropriate representation 
of bacterial evolutionary history. Nevertheless a common hypothesis is that 
prokaryotic evolution is primarily tree-like, but that the underlying trend is 
obscured by LGT. Extensive empirical work has sought to extract a common 
tree-like signal from conflicting gene trees. Here we give a probabilistic per- 
spective on the problem of recovering the tree-like trend despite LGT. Under 
a model of randomly distributed LGT, we show that the species phylogeny 
can be reconstructed even in the presence of surprisingly many (almost lin- 
ear number of) LGT events per gene tree. Our results, which are optimal up 
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to logarithmic factors, are based on the analysis of a robust, computation- 
ally efficient reconstruction method and provides insight into the design of 
such methods. Finally we show that our results have implications for the 
discovery of highways of gene sharing. 

1 Introduction 

High-throughput sequencing is transforming the study of evolution by allowing 
the integration of genome analysis and systematic studies, an area called phyloge- 
nomics [EF03, DBP05]. An important step in most phylogenomic analyses is 
the reconstruction of a tree of ancestor-descendant relationships — a gene tree — 
for each family of orthologous genes in a dataset. Such analyses have revealed 
widespread discordance between gene trees [GD08], leading some to question 
the meaningfulness of the Tree of Life [GDL02, ZLG04, GT05, BSL+05, DB07, 
KooOV]. In addition to statistical errors in gene tree estimation, various mecha- 
nisms commonly lead to incongruences between inferred gene histories, including 
hybridization events, duplications and losses in gene families, incomplete lineage 
sorting, and lateral genetic transfers [Mad97]. 

Here we study specifically lateral gene transfer (LGT), that is, the non- vertical 
transfer of genes between more or less distantly related organisms (as opposed to 
the standard vertical transmission between parent and offspring). Estimates of the 
fraction of genes that have undergone LGT vary widely — with some as high as 
99%. See e.g. [DM06, GD08] and references therein. LGT is particularly com- 
mon in bacterial evolution and it has been recognized to play an important role 
in microbial adaptation, selection and evolution with implications in the study 
of infectious diseases [SB05]. As a result, the bacterial phylogeny is usually in- 
ferred from genes that are thought to be immune to LGT, typically ribosomal RNA 
genes. However there is growing evidence that even such genes have in fact expe- 
rienced LGT [YZW99, vBTP+03, SSJ03, DSS+05]. In any case, LGT appears to 
be a major source of conflict between gene trees that must be taken into account 
appropriately in phylogenomic analyses, in particular when building phylogenies. 
This is the problem we address in this paper. 

Despite the confounding effect of LGT, we operate under the prevailing as- 
sumption that the evolution of organisms is governed primarily by vertical inheri- 
tance. In particular we ask: 

1. How much genetic transfer can be handled before the tree-like signal is 
completely erased? 
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2. What phylogenetic reconstruction methods are most effective under this hy- 
pothesis? 

These questions, and other related issues, have been the subject of some em- 
pirical and simulation-based work [BHR05, GWK05, GalOV, PWK09, PWKIO, 
KPWl 1]. See also [GD08, RB09] for enlightening discussions. In particular there 
is ample evidence that a strong tree-like signal can be extracted in the presence of 
extensive LGT (although some debate remains on this question [GDL02]). 

In this paper we provide the first (to our knowledge) mathematical analysis 
of the issues above. We work under a stochastic model of gene tree topologies 
positing that LGT events occur at more or less random locations on the species 
phylogeny [Gal07]. In our main result we establish quantitative bounds implying 
that surprisingly high levels of LGT — almost linear in the number of branches for 
each gene — can be handled by simple, computationally efficient inference proce- 
dures. That amount of genetic transfer appears to be much higher than known 
empirical estimates of LGT frequency based on genomic datasets in prokaryotes^ 
Hence our results indicate that an accurate, reliable bacterial phylogeny should be 
reconstructible if the vertical inheritance hypothesis is correct. We prove that our 
bound on the achievable rate of LGT is tight up to logarithmic factors. We also 
show that constraining LGT to closely related species makes the tree reconstruc- 
tion problem significantly easier. 

Our theoretical approach complements simulation-based studies in allowing a 
broad range of parameters and tree shapes to be considered. Moreover our anal- 
ysis provides new insights into the design of effective reconstruction methods in 
the presence of LGT. More precisely we focus on methodologies — both distance- 
based [KSOl] and quartet-based [ZGC^06] — that derive their statistical power 
from the aggregation of basic topological information across genes. 

In addition, we study the effect of so-called highways of gene sharing, roughly, 
preferred genetic exchanges between specific groups of species. Beiko et al. [BHR05] 
provided empirical evidence for the existence of such highways. To identify high- 
ways, they inferred LGT events by reconciling gene trees with a trusted species 
tree. In subsequent work, Bansal et al. [BBGS 11] formalized the problem and de- 
signed a fast highway detection algorithm that aggregates conflicting signal across 
genes rather than solving the difficult LGT inference problem on each gene tree. 
Similarly to Beiko et al., Bansal et al. rely on a trusted species tree. 

^Note that such estimates are typically based on small numbers of genomes and, therefore, are 
probably lower than reahty [GD08]. 
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Here we show that a species phylogeny can be reliably estimated in the pres- 
ence of both random LGT events and highways of LGT as long as such highways 
involve a small enough fraction of genes. Under extra assumptions, we also de- 
sign an algorithm for inferring the location of highways. Because we first recover 
the species phylogeny, our highway reconstruction algorithm does not require a 
trusted species tree. In essence, our results on highways indicate that robust phy- 
logeny reconstruction in the presence of random LGT extends to a phylogenetic 
network setting. For background on phylogenetic networks, see e.g. [HRSIO]. 

We note that there exist related lines of work in phylogenomics addressing the 
issue of incomplete lineage sorting [DR09] in the presence of gene transfers and 
hybridization events [TRIN07, JML09, Kub09, MK09, YTDNll, CAll] as well 
as work on probabilistic models involving gene duplications and losses [ALS09, 
CM06]. 

The rest of the paper is organized as follows. In Section 2, we define a stochas- 
tic model of LGT and state our main results. A high-level description of our anal- 
ysis is given in Section 3. Finally in Section 4 we extend our results to highways 
of gene sharing. 

The results presented here were announced without proof in [RSI 2]. 

2 Model and Main Results 

Before stating our main results, we present a stochastic model of LGT. Roughly, 
following Galtier [GalOV], we assume that LGT events occur more or less at ran- 
dom along the species phylogeny. Such a model appears to be consistent with 
empirical evidence [GD08]. 

Notation Recall that, for functions f{n),g{n), f = 0{g) means that there is 
constant C > such that f{n) < Cg{n) for all n large enough. Similarly, / = 
0.{g) indicates f{n) > C'g{n) for C > 0. In addition / = Q{g) is equivalent to 
/ = 0{g) and / = ^{g)- By polynomial in n, we mean 0{nP") for some constant 
C" > 0. We use the notation P[£o | ^i] for the conditional probability of £q given 
Si. 

2.1 Stochastic Model of LGT 

Gene trees and species phylogeny A species phylogeny (or phylogeny for short) 
is a graphical representation of the speciation history of a group of organisms. The 
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leaves correspond to extant or extinct species. Each branching indicates a speci- 

ation event. Moreover we associate to each edge a positive value corresponding 
to the time elapsed along that edge. For a tree T = (V, S) with leaf set L and 
a subset of leaves X C L, we let T\X be the restriction ofT to X, that is, the 
subtree of T where we keep only those vertices and edges on paths connecting 
two leaves in X. We say that T agrees (or is consistent) with T\X. 

Definition 1 (Phylogeny) A (species) phylogeny Tg = {Vg, Eg, Lg] r, r) is a rooted 
tree with vertex set Vg, edge set Eg and n (labelled) leaves Lg = [n] = {1, . . . ,n} 
such that 1) the degree of all internal vertices Vg — Lg is exactly 3 except the 
root r which has degree 2, and 2) the edges are assigned inter-speciation times 
T : Eg ^ (0, +oo). We assume that Tg includes n"*" > extant species Lf and 

> extinct species L~, where n = + n~. We also associate to each edge 
e E Eg in Tg a rate of lateral gene transfer < A(e) < +oo. We denote by 

— iy^ , Ef, Lf;r, r"*"), the subtree ofTg restricted to the extant leaves Lf, 
that is, — Tg\Lf rooted at the most recent common ancestor of . We further 
suppress vertices of degree 2 in except the root (in which case we add up the 
branch lengths to obtain r+j. We call Tj~ the extant phylogeny. We assume that 
T+ is ultrametric, that is, from every node, the path lengths from that node to all 
its descendant leaves are equal. 

Although we are ultimately interested in recovering the extant phylogeny, we 
include extinct species in the model as they can be involved in LGT events that 
affect the extant restriction of the tree. See e.g. [Mad97]. 

To infer the species phylogeny, we first reconstruct gene trees, that is, trees 
of ancestor-descendant relationships for orthologous genes or loci. Phylogenomic 
studies have revealed extensive discordance between such gene trees (e.g. [BSL+05, 
DB07]). 

Definition 2 (Gene tree) A gene tree Tg = (Vg, Eg, Lg-, Ug) for gene g is an un- 
rooted tree with vertex set Vg, edge set Eg and < Ug < n (labelled) leaves 
Lg Q {I, . . . ,n} with \Lg\ = Ug such that 1) the degree of every internal vertex is 
either 2 or 3, and 2) the edges are assigned branch lengths uig : Eg ^ (0, +00). 
We letTg — 7'[Tg] be the topology ofTg where each internal vertex of degree 2 is 
suppressed. 

Remark 1 (Gene trees vs. species phylogeny) As we will discuss below, gene 
trees are derived from — or "evolve" on — the species phlyogeny. They may differ 
from the species phylogeny for various reasons. First, in our model, their branch 
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lengths represent expected numbers of substitutions, instead time elapsed. More- 
over, their topology may differ as a result, in our case, of LGT events. See more 
details below. 

Remark 2 (Rooted vs. unrooted) Our stochastic model of LGT requires a rooted 
species phylogeny as time plays an important role in constraining valid LGT 
events. See, e.g., [JNST09]. In particular our results rely on the ultrametricity 
property of the extant phylogeny. In contrast, branch lengths in gene trees corre- 
spond to expected numbers of substitutions. As a result, gene trees are typically 
unrooted and do not satisfy ultrametricity. 

Remark 3 (Taxon sampling) Each leaf in a gene tree corresponds to an extant 
species in the species phylogeny. However, because of gene loss and taxon sam- 
pling, a taxon may not be represented in every gene tree. 

Remark 4 (Branch lengths) Each branch e in a gene tree Tg corresponds to a 
full or partial edge in the species phylogeny Tg. In particular, we allow internal 
vertices of degree 2 in a gene tree to potentially delineate between two consecutive 
species edges. We allow the branch lengths ujg{e) to be arbitrary, but one could 
easily consider cases where the branch lengths are determined by inter-speciation 
times, lineage-specific rates of substitution and gene-specific rates of substitution. 
The branch lengths will play a role in Section 5. 

Random LGT We formalize a stochastic model of LGT similar to Galtier's [GalOV]. 
See also [KSOl, Suc05, JNST06] for related models. The model accounts for LGT 
events originating at random locations on the species phylogeny with LGT rate 

A(e) prevailing along edge e. 

We will need the following notation. Let Tg = (K, Eg, Lg] r, r) be a fixed 
species phylogeny. By a location in Tg, we mean any position along Tg seen as a 
continuous object (also called R-tree), that is, a point x along an edge e e Eg. We 
write X E em that case. We denote the set of locations in Tg by Xg. For any two 
locations x, y in Xg, we let MRCA(a;, y) be their most recent common ancestor 
(MRCA) in Tg and we let T{x,y) be the length of the path connecting x and y in 
Tg under the metric naturally defined by the weights {T(e), e e Eg}, interpolated 
linearly to locations along an edge. In words r{x, y), which we refer to as the 
r-distance between x and y, is the sum of times to x and y from MRCA(x, y). 
We say that two locations x, y are contemporaneous if their respective r-distance 
to the root r is identical, that is, 

T{x,r) = T{y,r). 
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For > 0, we let 



= {yeX,: T{r,x) = T(r,2/), T{x,y) < 2R} 

be the set of locations contemporaneous to x at r-distance at most 2R from x (or 
in other words with MRCA at r-distance at most R). In particular, ci°°^ denotes 
the set of all locations contemporaneous to x. We let A(e) = A(e)r(e), e G Eg. 
We note that, since A(e) is the LGT rate on e, A(e) gives the expected number of 
LGT events along e. Further, we let 

Atot= 5^A(e), 
be the total LGT weight of the phylogeny and 

A= Yl ^(^)' 

ee£{Ts\Lt) 

be the total LGT weight of the extant phylogeny, where S{Ts\Lf) denotes the 
edge set of Ts\Lf. 

Our model of LGT is the following. Note first that, from a topological point 
of view, an LGT transfer is equivalent to a subtree-prune-and-regraft (SPR) op- 
eration [SS03]. The recipient location, that is, the location receiving the genetic 
transfer, is the point of pruning. Similarly, the donor location is the point of re- 
grafting. In other words, on the gene tree, a new internal node is created at the 
donor location with two children nodes, one being the original endpoint of the 
corresponding edge and the other being the node immediately under the recipient 
location in the species phylogeny. The original edge going to the latter node is 
removed. See Figure 1. 

Definition 3 (Random LGT) Let < R < +oo possibly depending on n (i.e. 

not necessarily a constant) and note that we explicitly allow R = +oo. Let 
Ts = {Vs, Eg, Ls'jT.t) be a fixed species phylogeny. Let < p < 1 be a sam- 
pling effort probability. A gene tree topology Tg is generated according to the fol- 
lowing continuous-time stochastic process which gradually modifies the species 
phylogeny starting at the root. There are two components to the process: 

L LGT locations. The recipient and donor locations of LGT events are se- 
lected as follows: 
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Figure 1 : An LGT event. On the left, the species phylogeny is shown with the 
donor (D) and recipient (R) locations. On the right, the resulting (unweighted) 
gene tree is shown after the LGT transfer. 

• Recipient locations. Starting from the root, along each branch e of 
Tg, locations are selected as recipient of a genetic tranfer according 
to a continuous-time Poisson process with rate A(e). Equivalently, the 
total number of LGT events is Poisson with mean Atot and each such 
event is located independently according to the following density. For 
a location x on branch e, the density at x is A(e) / Atot- 

• Donor locations. If x is selected as a recipient location, the corre- 
sponding donor location y is chosen uniformly at random in The 
LGT transfer is then obtained by performing an SPR move from x to 
y, that is, the subtree below x in Tg is moved to y in Tg. Note that we 
perform genetic transfers chronologically from the root. 

2. Taxon sampling. Each extant leaf is kept independently with probability p. 
( One could also consider a different probability for each leaf. We use a fixed 

sampling effort p for simplicity.) The set of leaves selected is denoted by Lg. 
The final gene tree Tg is then obtained by keeping the subtree restricted to 

w 

The resulting ( random) gene tree topology is denoted by Tg. 

When R < +00 a transfer can only occur between sufficiently closely related 
species. One could also consider more general donor location distributions. See 
e.g. [PWKIO] . In Section 4, we consider a different form of preferential exchange, 
highways of gene sharing. 
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2.2 Recovering the tree-like trend: Main results 



Problem statement Let Tg = {Vg, Eg, Ls;r,T) be an unknown species phy- 
logeny. Using homologous gene sequences for every gene at hand, we generate N 
independent gene tree topologies 7^^ , . . . , 7^^ as above. Given the gene trees (or 
their topologies), we seek to reconstruct the topology 7^+ = T[T/] of the extant 
phylogeny T/. More precisely we are interested in the amount of LGT that can be 
sustained without obscuring the phylogenetic signal. To derive asymptotic results 
about this question, we make some assumptions on the underlying phylogeny. We 
discuss two cases in detail. 

In practice, one estimates gene trees from sequence data. We come back to 
gene tree estimation issues below. 

Bounded-rates model The following assumption was introduced in [DR 1 0] and 
is related to a common assumption in the mathematical phylogenetics literature. 

Definition 4 (Bounded-rates model) Let < px < I and < pr < I be con- 
stants. Let further < r < +oo be a constant and < A < +oo be a value 
possibly depending on n^. Under the Bounded-rates model, we consider the set 

of phylogenies Tg = {Vs, Eg, Ls;r,T) with > extant leaves and n~ > 
extinct leaves and extant phylogeny T+ = {Vg^, E^, L+; r, r+) such that the fol- 
lowing conditions are satisfied: 



Our result in this case is the following. We use A to control the amount of LGT 
in the model. 

Theorem 1 (Main result: Bounded- rates model, R = +oo) Let R = +oo. Un- 
der the Bounded-rates model, it is possible to reconstruct the topology of the ex- 
tant phylogeny with high probability (w.h.p.) from N — f2(logn'^) gene tree 
topologies ifX is such that 



A = PaA < A(e) < A, Ve e E, 



and 



T = PrT< T+(e+) < T, Ve+ e E+. 
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In words, we can reconstruct the species phylogeny w.h.p. as long as the ex- 
pected number of LGT events A (as measured on the extant phylogeny) per gene 
is at most of the order of ^^'1^+ . This result is based on a polynomial-time algo- 
rithm we describe in Section 3. Note that, in typical phylogenomic studies, the 
number of genes is much larger than the number of species. Therefore, our as- 
sumption that the number of genes should be at least of the order of the logarithm 
of the number of extant species is mild. 

We also show that the bound on A in Theorem 1 is close to optimal, up to 
logarithmic factors. 

Theorem 2 (Non-recoverability) Under the Bounded-rates model as above with 
N = 0(logn"'"), the topology of the extant phylogeny cannot, in general, be re- 
constructed w.h.p. if A is such that A = f2(n+ log log n"*"). 

More generally, the species phylogeny cannot be reconstructed from genes if 
A = ^^(n+logiV). Theorem 2 is proved by a coupling argument [Lin92]. In 
words we show that, with the order of 0(71+ log log n"*") expected LGT events, 
there is insufficient signal from the gene trees to distinguish between two species 
phylogenies with high probabiUty. 

Yule process Branching processes are commonly used to model species phylo- 
genies [RY96]. In the continuous-time Yule process (or pure-birth process), one 
starts with two species (representing the two branches emanating from the root). 
At any given time, each species generates a new offspring at rate < < -l-oo. 
We stop the process when the number of species is exactly n + 1 (and ignore the 
n + 1st species). This process generates a species phylogeny with n = extant 
species with branch lengths given by the inter-speciation times in the above pro- 
cess. Note that n~ = by construction. Let < pa < 1 be a constant. We also 
assume that 

A = PaA < A(e) < A, Ve e E,, 

for some < A < -l-oo possibly depending on n. As above, we use A to control 
the amount of LGT in the model. 

An advantage of the Yule model is that, unlike the Bounded-rates model, it 
does not place arbitrary constraints on the inter-speciation times. In particular, the 
following analog of Theorem 1 suggests that our analysis does not rely on such 
constraints. 
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Theorem 3 (Main result: Yule process, R — +00) Let R — +00. Under the 

Yule model, the following holds with probability arbitrarily close to 1. It is possi- 
ble to reconstruct the topology of the extant phylogeny w.h.p.from N = VL{\ogn) 
gene tree topologies ifX is such that 



O 



n 



logn 



Preferential LGT When R < +00, that is, when transfers occur only between 
sufficiently related species, we obtain the following generalization which implies 
that preferential LGT makes the tree-building problem easier. 

Theorem 4 (Preferential LGT) Let < i? < log n+ possibly depending on 

Under the Bounded-rates model, it is possible to reconstruct the topology of the 
extant phylogeny w.h.p.from N = fl{logn^) gene tree topologies ifX is such that 

~R 



A = O 

A similar result holds under the Yule model. 



Further results We also obtain results on highways of LGT as well as sequence- 
length requirements. These results require additional background. See Sections 4 
and 5 respectively. 



3 Probabilistic Analysis 

We assume that we are given N independent gene tree topologies Tg^,..., Tgj^ as 
above. Our goal is to reconstruct the extant phylogeny. 

Different algorithms are possible. A simple approach is to take a majority 
vote over all gene tree topologies. But this approach is problematic under taxon 
sampUng and cannot sustain the high levels of LGT we consider below. 

Instead we consider approaches that aggregate partial information over all 
gene trees. We focus on subtrees over four taxa whose topologies are called quar- 
tets [SS03]. We show that computationally efficient quartet-based approaches can 
sustain high levels of LGT. Although we prove our results for the specific method 
described below, our analysis is likely to apply to related methods. In Section 5.1, 
we also give a similar analysis for a distance-based method of Kim and Salis- 
bury [KSOl]. 
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3.1 Algorithm 



We consider the following approach related to an algorithm of Zhaxybayeva et 
al. [ZGC+06]. Let X — {a, 6, c, d} be a four-tuple of extant species The topology 

T|X of a tree T restricted to X can be summarized with a quartet split, or quartet 
for short. There are three possible (resolved) quartets which we denote qi = ab\cd, 
q2 = ac\bd, and 93 = ad\bc. We first compute the frequency of each quartet over 
all gene trees displaying X, that is, over all gene trees g such that X C Lg, 

and similarly for 52,53- (We set the frequency to if the denominator is 0.) For 
each X, we choose the quartet with highest frequency (breaking ties arbitrarily). 

Definition 5 A set of quartets Q = {qi}, with Lg. the leaf set ofqi, is compatible 
if there is a tree T with leaf set Lq = {Jq.^qLq. such that T agrees with every qi. 

Quartet compatibility is, in general, NP-hard [Ste92]. However, when the set Q 
covers all possible four-tuple of taxa (that is, exactly (^) quartets with no repeated 
four-tuple of taxa), there is a polynomial-time algorithm for compatibility [BD86, 
BunVl, BGOl]. In our procedure, for every four-tuple of taxa, there is a single 
quartet chosen, so we can check compatibility easily and output the corresponding 
tree. In practice, if Q is not compatible, one can use instead a heuristic supertree 
method such as MRP [Rag92, Bau92] or Quartet MaxCut [SRIO, SR12]. 

The algorithm, which we call QuartetPlurality (QP), is detailed in Figure 6. 



3.2 A general formula 

Our asymptotic analysis is based on the following claim. Recall that, for a subset 
of extant species X, we let %\X be the extant phylogeny topology restricted to 
X with corresponding edge set £{%\X). Also recall that A(e) = A(e)r(e) is the 
expected number of LGT events on edge e which we refer to as the LGT weight, 
or weight for short, of e. Let 

Ax= 5^ A(e), 

ee£{Ts\X) 

be the total weight of the subtree Ts\X under the weights A(e), e e Eg. Define 
the maximum quartet weight (MQW) as 

T(^) = max{Ax : X C (L+)^}. 
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Algorithm QuartetPluraUty 

Input: Gene trees g'l, . . . , g^; 

Output: Estimated species phylogeny T; 



Set Q = 



For all four-tuple of taxa X = {a, b, c, 



d}, letting q\ = ab\cd, compute 



fx{qi) = 



\{gi : X C Lg^}\ 



and similarly for q2 = ac\bd and = ad\bc. Add the quartet witli liighest 
frequency (breaking ties arbitrarily) to Q. 

• Using Buneman's algoritlim [BunVl] compute the tree T compatible with Q 
(or abort if no such tree is found). 

• Output f. 



Lemma 1 (Probability of a miss) Let Tg be a gene tree topology distributed ac- 
cording to the random LGT model such that X = {a, b, c, d} C Lg. Let qf 
(respectively ) be the quartet corresponding toTg\X (respectively Ts\X). Then 



Recall that A is the expected number of LGT events (as measured on the extant 
phylogeny) per gene. As a comparison, note that the probability that a gene tree 
is LGT-free is e~^, which can be much smaller. 

Proof (Lemma 1): We first note that, by our assumption that the species phy- 
logeny is bifurcating, qf is resolved. Similarly is resolved because under a 
Poisson process for the recipient location the probability that a vertex has degree 
higher than 2 (that is, that a pruning and re-grafting occurs exactly at the location 
of an existing vertex) is 0. 

Now we observe that if none of the recipient locations lands on 7^ |X then the 
corresponding quartet remains intact. Indeed an SPR move can only (potentially) 
affect those quartets with at least one leaf in the pruned subtree, and this happens 
with probability The claim then follows by induction on the number of LGT 



Figure 2: Algorithm QuartetPluraUty. 



events. 
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Hence the probability that — is at least the probability that all LGT 
events (on the extant phylogeny) miss Ts\X, which is at least 




) 



) 



3.3 Bounded-rates and Yule models 

Next we argue that, under appropriate assumptions on the species phylogeny, the 
maximum quartet weight is bounded in such a way that the plurality quartet topol- 
ogy for every four-tuple of taxa X = {a, b, c, d}, which we denote by , satisfies 
q^ — q^. As a result, our quartet set is compatible and 7^+ can be reconstructed 
efficiently. 

3.3.1 Bounded-rates model 

We bound the maximum quartet weight T*^^) in the Bounded-rates model. 

Lemma 2 (Bound on quartet weight: Bounded-rates case) Under the Bounded- 
rates model it holds that 



Proof (Lemma 2): The first part of the proof is taken from [DRIO]. Let h (respec- 
tively H) be the smallest (respectively largest) number of edges on a path between 
the root and an extant leaf. Because the number of extant leaves is n+ and the ex- 
tant phylogeny is bifurcating (recall that we suppressed vertices of degree 2 after 
taking a restriction to the extant species), we must have 2^ < n"*" and 2^ > n+. 
Since all extant leaves are contemporaneous it must be that if r < hf. Combining 
these constraints gives 



T(^) = 0(Alogn 



+ 



), A = e(An+). 



- log2 <h < H < - log2 . 



T T 
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Hence _ 

max{Ax : X C (L+)^} < 4At^ logs n^- 

The total number of edges in the extant phylogeny is 2n'^ — 3 so that 

A = e(An+). 

■ 

Using Lemma 2, we prove Theorem 1. First recall the following standard 
concentration inequality (see e.g. [MR95]): 

Lemma 3 (Azuma-Hoeffding Inequality) Suppose Z = {Zi, .... Zm) are inde- 
pendent random variables taking values in a set S, and h : S"^ — > M is any 
t-Lipschitz function: \h{z) — h{z')\ < t whenever z,z' e S™' differ at just one 
coordinate. Then, VC > 0, 

P[|/i(Z) -E[/i(Z)]| > C] < 2exp 

Proof (Theorem 1): Consider the quartet-based approach described in Section 3.1. 
Take A = Ci/ log n"*" with Ci > small enough so that 

and using Lemmas 1 and 2, we have for any four-tuple X of extant species 

p[xcL,]=p^ 

and 

P[gf = gf I XCL,]> exp (-T(^)) > exp (-O(Ci)) > ^, 
for Ci small enough. We choose C2 > large enough with 

N > C2logn+, 

and e < p'^ so that, using Lemma 3, the following inequalities hold. Consider the 
following events 

So^{\\{gi:XCLg,}\-Np^\<Ne} 
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and 

^1 = ||{^. : ^ C L,, %^X = q,}\ > ^-{{g,: X C L,J|| . 
By Lemma 3, 

¥[S^] < exp {-0{e'N)) , 

and 

P[£^ I £o] < exp (-0(7V(p4 
Hence, for a constant C2 large enough, 

P[/x(gf ) < 1/2] < P[^o1 + P[^r I ^"0] 

Then the plurality vote is correct for every four-tuple of taxa and the extant phy- 
logeny is correctly reconstructed. ■ 

3.3.2 Yule process 

We now consider the Yule model. 

Lemma 4 (Bound on quartet weight: Yule case) Under the Yule model, it holds 
that 

T(^) = e (A log n) , A = e (An) 
with probability approaching 1 as n ^ +00. 

Proof (Lemma 4): We consider a pure-birth process with birth rate u starting 

from 2 species. For background on branching processes see [AN72]. 

Let Zi be the {i — l)-th inter-speciation time. As a minimum of i independent 
exponential distributions with mean l/v, is an exponential with mean Xjiiv). 
Moreover the ZjS are independent. Hence the height of the phylogeny in time 
units, that is, the total time until n+1 species are present (recall that we ignore 
the (n + l)-st species) is 

n+1 
i=2 

and we have 

n+1 n+1 ^ 

E[Z] = 5^E[Z,] = = 0(^"'logn), 

i=1 i=2 
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and 

n+1 n+1 



Var[Z] = 5^Var[Z.] = J]-^ = e(.-). 



i=2 i=2 



The total weight of the phylogeny in time units 

n+1 



is a sum of n independent exponential random variables with parameter i/, and we 
have 



E[Y] = 5]iE[Z,] ^J2 '-= ly-'n, 

i=2 1=2 

and 

n+1 n+1 ^ 

Var[Y] = ^z^Var[Z.] = E^'^ = 



i=2 i=2 

By Chebyshev's inequality, 

a 



[Z>Cr\ogn]< ;\ ^0, 
Cglog n 

and 

for appropriately chosen Cs not depending on n. The same holds in the other 
direction so that T^^^ = G(A logn) and A = 6(Ari) with probability approaching 
1. ■ 

Proof (Theorem 3): Using Lemma 4, the proof of Theorem 3 follows form the 
same lines as that of Theorem 1. ■ 

3.4 Preferential LGT 

We now prove Theorem 4. 

Proof (Theorem 4): The proof is similar to that of Theorems 1 and 3. The main 
difference is in the proof of Lemma 1 . In that proof, note that if < +oo then for 

an LGT to affect the quartet on X, it must be that not only 1) the recipient location 
lands on 7^ |X, but also 2) that it lands on a location below either branchings of the 



17 



corresponding quartet tree within time R of the branching point. Indeed these are 

the only locations where the corresponding leg of the quartet tree can potentially 
jump to a subtree corresponding to a different leg. (In fact, it must be that a leg 
on the other side of the internal branch of the quartet tree is within time 2R.) 
The length of this region is at most AR in r-distance. Hence in the bound on the 
probability of a miss we get 

= C Lg] > exp (- min{r(^\ 4i?A}) . 

The result then follows. ■ 

3.5 Non-recoverability 

We now prove Theorem 2. 

Proof (Theorem 2): We use a coupling argument [Lin92]. Fix 5 > small. 

We construct two species phylogenies with different topologies which cannot be 
distinguished with probability 1 — 5 from N gene tree topologies when the total 
expected amount of LGT A is of the order of n+ log log n+ per gene. In particular 
the reconstruction problem cannot be solved in that case. The idea of a coupling is 
to run the stochastic processes of LGT on both phylogenies simultaneously so as 
to output the same gene trees with high probability without changing the marginal 
distributions (that is, the probability distributions of gene tree topologies on each 
phylogeny separately). 

We proceed as follows. Consider a complete binary tree on a set of n leaves 
(all extant) and denote the four children at height 2 from the root as a, 6, c, d, where 
a and h are sisters and so are c and d. Let be the subtree with n/4 leaves rooted 
at 2; G {a, 6, c, d}. Moreover, for simplicity, assume all edges of T'^ have the same 
LGT weight. From we construct T'^ by rewiring the four nodes {a, 6, c, d} such 
that a is now sister with c and h with d. 

We generate = ©(logn) genes trees on each of and T'J as follows. 
We run the stochastic process of LGT on T'^ as described in Definition 3. Let 
Tg^, ■ ■ ■ ■ Tg^, be the gene tree topologies so obtained. For and every gene, we 
use exactly the same LGT events as the ones generated on where we identify 
the two edges adjacent to the roots in and T'J arbitrarily. Let 7^'', . . . , Tg^ be 
the gene tree topologies so obtained. 

Since T'^ and are identical below every z E {a, b, c, d} and LGT events 
occur only between contemporaneous points, the subtrees under {a, b, c, d} in Tg. 
and Tg. are identical for every gene i. 
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Figure 3: Good event. 



For z ^ Z, let be the edge adjacent to z and above it in T[ (and in Tj')- It 
remains to show that, for T' and 71" to be identical under the joint construction 
above, it suffices that the following good event occurs: three consecutive LGT 
moves start on the same edge in . . . , (donor location) and land on the other 
three edges in ea, . . . , (recipient location), for example, a ^ d,a ^ c,a ^h. 
See Figure 3. Indeed, in that case, the first donor location above becomes the 
common ancestor to all nodes in the gene trees. From that point on, we obtain the 
same gene tree for both phylogenies. 

We claim that the probability that the good event does not occur is O ( 1 / log n) . 
Under the assumption that A = Q(nloglogn) and that the LGT weights are 
equal, the number of LGT events on any edge is Poisson with mean i7(loglogn). 
Consider the time interval between the nodes at height 1 from the root and the 
nodes at height 2. Divide this interval into u = O(loglogn) equal subintervals 
such that the number of LGT events on edge in Jj is Poisson with 
mean Co for some constant Co > 0. In 7j the probability that there is no LGT event 
originating from e^, . . . , and that there is exactly three LGT events originating 
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from Ca and landing on 65,60, in that order is 

The subintervals are independent. The probabiUty that the event above does not 
happen in any of /i, . . . , 7,^, is 

= (1 - c^y = o (-^\ . 

\lognJ 

This gives an upper bound of 0(1/ logn) on the probability that the good event 
does not happen. 

Therefore, by a union bound over the genes, the probability that the good event 
does not occur on at least one gene tree is 6(log n) ■ 0(1/ log n) = 0(1), which 
is at most 5 if the constant in A is large enough. If the good event occurs on every 
gene tree, then both phylogenies output the exact same set of gene tree topologies. 
That concludes the proof. ■ 

4 Highways of LGT 

In this section, we add highways of gene sharing to the model. Highways are, 
in essence, non-random patterns of LGT [BHR05]. These can potentially take 
different shapes. Following Bansal et al. [BBGSll], we focus on pairs of edges 
in the phylogeny that undergo an unusually large number of LGT events between 
them. 

We give two results. As long as the frequency of genes affected by highways is 
low enough, the species phylogeny can be reconstructed using the same approach 
as in Section 3. Moreover, with extra assumptions on the positions of the highways 
with respect to each other, the highways themselves can be inferred. 

In this section, we assume n" — 0. 

4.1 Model 

We generalize our model of LGT as follows. 

Definition 6 (Higliways of LGT) Let Tg = {Vg, Eg, Lg] r, r) be a species phy- 
logeny with LGT rates < X{e) < +00, e & Eg and let < p < 1 be a taxon 
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sampling probability. Assume n~ — 0. For ^ — 1, . . . , B, let = (^l^o' ^0,1) 
be a pair of edges in Tg which share contemporaneous locations. We call a 
highway. Let gi, . . . .qn be N genes. Each highway Hg involves a subset 
of the genes. If gene Qi G G^, then it undergoes an LGT event between a pair 
of contemporaneous locations G e^Q and G e^^ We let 7^ be the frac- 
tion of genes such that gi G G^ and we assume that 7^ > 7 /or some 7 (chosen 
below). In addition, independently from the above, we assume that each gene un- 
dergoes LGT events at random locations as described in Definition 3. We denote 
by Tgi, ■ ■ ■ , Tgj^ the gene tree topologies so obtained. 

Remark 5 (Deterministic setting) Note that the highways and which genes are 
involved in them are deterministic in this setting. Only the random LGT events are 
governed by a stochastic process. Note moreover that we allow highway events to 
go in either direction, that is, from e^Q to ef i or vice versa. 



4.2 Building the species tree in the presence of highways 

We first prove that the species phylogeny can still be reconstructed in the presence 
of highways as long as the fraction of genes involved in highways is low enough. 
We only discuss the Bounded-rates model with R — +00. 

Theorem 5 (Highways of LGT) Consider the Bounded-rates model with R — 
+00 and assume that B < +00 is constant. Assume further that there is a constant 
< 7 < 1 such that 

7/3 < 7, P = 1,...,B. 

If 

then it is possible to reconstruct the topology of the extant phylogeny w.h.p.from 
N = flilogn'^) gene tree topologies if \ is such that 



O 



n 



Proof (Theorem 5): The proof is similar to that of Theorem 1 . Note that a quartet 
tree in the species phylogeny can be affected by a highway in at most a fraction 
< = I of the genes. Moreover by the proof of Lemma 1, choosing Ci small 
enough, a quartet tree is affected by a random LGT event in an arbitrarily small 
fraction of genes. Therefore the plurality vote will reconstruct the correct split 
with high probability. The result follows. ■ 
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4.3 Inferring highways 



The problem of inferring the highway locations is essentially a network recon- 
struction problem. Such problems are often computationally intractable. See 
e.g. [HRSIO]. Therefore, we require some extra assumptions. Our goal here is 
not to provide the most general result but rather to illustrate that our analysis ex- 
tends naturally to certain network settings. The following assumption is related to 
so-called galled trees. 

Assumption 1 We assume that no highway connects two edges in Tg separated 
by less than two edges or edges adjacent to root edges. (Such cases cannot be 
reconstructed.) Seen as an edge superimposed on Tg, a highway event i/^J 
forms a cycle. We assume that all such cycles are disjoint, that is, they do not 
share common locations. 

We then prove the following. We use a computationally efficient algorithm, which 
we call RoadRoUer, described in Figure 4 and explained in the proof. 

Theorem 6 (Inferring highways) Consider the Bounded-rates model with R ~ 
+00 and assume that B < +oo is constant. Assume jurther that there are con- 
stants < 7 < 7 < +00 such that 



and Assumption 1 holds then it is possible to reconstruct the topology of the extant 
phylogeny as well as the highway edges w.h.p. from N — Q(logn"'") gene tree 



Proof (Theorem 6): Consider a four-tuple X such that Ts \X contains at least one 
highway location and such that the quartet qf is modified by the corresponding 
highway. Because such a highway must connect a leg of 7^ |X to a subtree on 
the other side of the internal branch of Ts\X, our galled tree assumption implies 
that any given quartet tree can be affected by at most one highway, otherwise the 
corresponding cycles would intersect along the internal branch. Hence, from the 



7<7/3<7, /3 = l,...,B. 



1 




topologies ifX is such that 
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Algorithm RoadRoUer 

Input: Gene trees g'l, . . . , g^; 

Output: Estimated species phylogeny T and highway locations; 

• Use QuartetPIuraUty to reconstruct the species phylogeny T. Let Q be the set 
of all quartets whose estimated frequency is less than 1/2 but more than 7/2. 

• For all pairs of four-tuples X ^ X' (possibly sharing taxa) with a correspond- 
ing quartet in Q, 

- Find the shared edges e(X, X') along the internal branches of 7^|X and 
Ts\X'. 

- Let X - X' if e{X, X') ^ 0. 

• Build the graph Q corresponding to ~ with vertex set being all Xs with a 
corresponding quartet in Q. 

• For each connected component W of Q, 

- Compute the union V of all e(X, X') over pairs X and X' in W. Abort 
if V is not a path. 

- Let and eY be the start and end edges on the path V. 

- For i = 0, 1, let e~ and be the edges adjacent to eY . 

- For each pair with one element in {eg , ej"} and one element in {ej^ , e]*"}, 
determine whether each Ts\X with X '\nW contains at least one element 
in the pair. 

- If only one pair passed the previous test, 

* Denote the pair by {e^\, e^), 

* Else, let eY be the intersection of the pairs found (abort if the inter- 
section does not contain a unique element), choose an X in such 
that 7^|X includes all of {eg , e^} and {e]", e^}, and use the corre- 
sponding quartet in Q to determine the sister leaf to the leaf below 

• The latter leaf is below edge eY among {cq , ,e^}. 

• Output f and {e^ , ) for all W. 

Figure 4: Algorithm RoadRoUer. 
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proof of Theorem 5 and the assumption that 7 < | (instead of 7 < 2^), we can 
reconstruct the extant phylogeny. 

Further, it follows by the proof of Theorem 5 that, if 7 > and Ci is small 
enough, the second most frequent quartet over a four-tuple as above is the one 
obtained by going through the highway. Let Q be the set of all quartets whose 
estimated frequency is less than 1/2 but more than 7/2. By the previous argument 
and Lemma 3 (see the proof of Theorem 1 for a similar computation), Q contains 
w.h.p. exactly those quartets affected by a highway. 

For X,X' with quartets in Q, write X ~ X' if the quartet trees Ts\X and 
Ts\X' share an edge along their internal branch. Let e{X, X') be the set of all 
such shared edges. Note that, although we are considering four-tuples affected 
by highways, we are working on the species phylogeny Ts which has been recon- 
structed. 

By the argument above, quartets sharing an edge along their internal branch 
are necessarily affected by the same highway. Take the transitive closure ~* of ~. 
Let W be an equivalence class of We reconstruct the corresponding highway 
as follows. The union of all edges in e(X, X') for some pair X, X' in W forms 
a path V in 71. Let ej^ and be the start and end edges on this path. The 
highway corresponding to W connects an edge adjacent to with an edge 
eY adjacent to eY ■ See Figure 5. (Note that a highway is represented by exactly 
one W because w.h.p. all quartets affected by this highway are in Q and they are 
all connected under ~. See Figure 5.) 

As we argued in the proof of Lemma 1, all quartets affected by the highway 
corresponding to W contain at least one leaf in a pruned subtree. Because we al- 
low LGT events in both direction along a highway, there are two potential pruned 
subtrees. Moreover, the other three leaves must be in separate subtrees hanging 
from the path V. By our assumption, there are at least three such subtrees (in 
addition to the two potentially pruned subtrees). 

Hence, the pruned subtrees can be identified by checking the four-tuples in W 
and finding the pairs of subtrees with at least one of them present in all of W. If 
there is a unique such pair, this gives the two highway edges and we are done. 
Otherwise, the recipient edge is the intersection of the pairs found. To identify the 
donor edge, one simply needs to use a four-tuple X of leaves in the four adjacent 
subtrees to the endpoints of V and check to which branch of 7^|X the subtree 
corresponding to the recipient edge is moved in Q (that is, in the highway-affected 
quartet topology). ■ 
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a 



b 



c 



d 



Figure 5: Setup in the proof of Theorem 6. The grey arrow indicates a highway. 
Here X = {a, 6, c, d}, Ts\X = ab\cd and hc\ad e Q. 

5 Distance method and sequence lengths 

In this section, in the highway-free case, we analyze an alternative, distance-based 
approach that has been considered in the literature and we provide sequence- 
length requirements. Although the quartet-based method analyzed in Section 3 
can in principle handle arbitrary branch lengths (as only the topology of the gene 
trees is used), here we need to assume that the gene tree branch lengths are de- 
termined by inter-speciation times and lineage- specific rates of substitution. For 
simplicity, we assume that there is no gene-specific substitution rate. In practice, 
one could incorporate such rates by using a normalization procedure as detailed 
in [KSOl, GWK05]. 

5.1 A distance-based approach 

We analyze a distance-based approach similar to that introduced in [KSOl] and 
studied empirically in [GWK05]. Given branch lengths, a gene tree is naturally 
equipped with a tree metric on the leaves Dg : Lg x Lg ^ (0, -|-oo) defined as 
follows 




eePg{u,v) 
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where Pg{u, v) is the set of edges on the path between u and v in Tg. We will refer 

to D(,(?i. ?') as the evolutionary distance between u and v under g. 
For each pair of extant species {a, 6}, we compute the median 

D^(a, h) = Median{Dg.(a, 6) : i = 1, . . . , A^, {a, 6} C Lg.}. 

We abort if a pair is not included in any of the gene trees. We then use the distance 
matrix Dm to build a tree using the Short Quartet Method [ESSW99a] (or any 
other statistically consistent, fast-converging distance-based method). We will 
refer to this method as the MedianTree (MT) method. The algorithm is detailed 
in Figure 6. 

Algorithm MedianTree 

Input: N alignments over the taxa [n]; 

Output: Estimated species phylogeny T; 

• For each gene gi and each pair of taxa {a, 6}, compute the log-det distance 
Dg,(a, d). 

• For all pairs of taxa {a, 6}, compute 

Di„(a, h) = Median ^Bg.{a,b) : i = 1, . . . ,N, {a, b} C L^. } . 

• Using SQM [ESSW99a] on the distance-matrix {Dm (a, 6) compute 
the tree T (or abort if no tree is found). 

• Output f. 

Figure 6: Algorithm MedianTree. 

Probabilistic analysis Define the maximum path weight (MPW) 

T(2) = max{Ax : X C {L+f}. 

Then: 

Lemma 5 (Probability of a miss: Distance case) Let Tg = {Vg, Eg, Lg;ujg) be a 

gene tree distributed according to the random LGT model such that X = {a, b} C 
Lg. Let Ds(a, b) be the evolutionary distance between a and b under the topology 
of the extant phylogeny ( that is, under the event that no LGT has occurred). Then 

P[D,(a,6) = D3(a,6)|X C Lg] > exp (-T^^)) . 
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Proof (Lemma 5): The proof is similar to that of Lemma L ■ 

Lemma 6 (Bound on path weight: Bounded-rates case) Under the Bounded-rates 
model, it holds that 

T(2)=0(Alogn+). 

Proof (Lenrnia 6): Note that 

max{Ax : X C (L+)2} < 2At- log2 n+. 



Lemma 7 (Bound on path weight: Yule case) Under the Yule model, it holds 
that 

T(2) = e(Alogn) , 
with probability approaching 1 as n ^ +oo. 

Proof (Lemma 7): The proof is similar to that of Lemma 4. ■ 

Proof: (Theorems 1 and 3) Using MT and Lemmas 6 and 7, the proof of Theorem 1 
(and of Theorem 3) follows from the same lines as that of Theorem 1. Note 
however that our extra assumption on the gene tree branch lengths is needed here 
to ensure that evolutionary distances are the same across all genes. ■ 

5.2 Taking into account sequence length 

We have assumed so far that gene tree topologies and evolutionary distances are 
known perfectly. Of course, this is not the case in practice and the effect of se- 
quence length must be accounted for. One issue that arises is that LGT events 
may create very short branches that are difficult to infer. Nevertheless, we can 
prove the following. We assume that sequence data is generated independently on 
each gene tree according to a GTR model. Evolutionary distances are estimated 
using the log-det distance. See e.g. [SS03] for background on GTR models of 
substitution and the log-det distance. We assume n~ = for simplicity. 

Theorem 7 (Sequence-length requirements) Under the Bounded-rates and Yule 
models for the species phylogeny and the GTR model for sequences, assuming that 
substitution rates are bounded between constants, a sequence length per gene 
polynomial in n suffices for the MT algorithm to succeed if the number of genes is 
at most polynomial in n. 
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Proof (Theorem 7): We only discuss the Yule model. The argument for the 

Bounded-rates model is similar. 

In our second proof of Theorem 3, we relied on the fact that, for every pair 
of taxa w.h.p., a strict majority of the gene tree evolutionary distances is not been 
affected by LGT. Hence, if the worst case estimation error on the evolutionary 
distances is e, then the median of the estimated distances must be in the interval 
[Ds(a, h) — e, Ds(a, h) + e] for all pairs of taxa a, h. Further, by the concentration 
bounds in [ESSW99b], for the SQM step of our MT algorithm to return the correct 
topology w.h.p., the sequence length must scale as an exponential of the depth of 
the tree divided by the square of the shortest branch length. 

Under the Yule model, with probability approaching 1, the depth of the tree is 
0{\og n) (by the proof of Lemma 4) and the shortest branch length (the minimum 
of 0{n) exponentials with mean 0(1)) is l/poly(n). Hence the result follows.^ 



6 Discussion 

We have shown that a species phylogeny or network can be reconstructed despite 
high levels of random LGT and we have provided explicit quantitative bounds on 
tolerable rates of LGT. Moreover our analysis sheds light on effective approaches 
for species tree building in the presence of LGT. Several problems remain open: 

• Galtier and Daubin [GD08] hypothesize that random LGT only becomes a 
significant hurdle when the rate of LGT greatly exceeds the rate of diversifi- 
cation. In our setting this would imply that a value of A as high as n{n) may 
be achievable. Note that branches close to the leaves are particularly easy 
to reconstruct because they lie on small quartet trees that are less likely than 
deep ones to be hit by an LGT event. Is a recursive approach starting from 
the leaves possible here? See [Mos04, DMRl 1] for recursive approaches in 
a related context. 

• In a related problem, we have analyzed distance-based and quartet-based 
methods. A better understanding of bipartition-based approaches is needed 
and may lead to a higher threshold for A. 

• What can be proved when a model of extinction is incorporated? 

^Note that unlike [ESSW99a] we use the inter- speciation times generated by the continuous- 
time branching process. In particular their "few logs" result does not apply to our setting. 
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• What can be proved when the number of genes is significantly less than 
logn? 

• In the presence of highways, dealing with more general network settings 
would be desirable. Also our definition of highways as connecting two 
edges is somewhat restrictive. In general, one is also interested in preferen- 
tial genetic transfers between clades. 

• On the practical side, the predictions made here should be further tested 
on real and simulated datasets. We note that there is extisting work in this 
direction [BHR05, GWK05, GalOV, PWK09, PWKIO, KPWU, BBGSU]. 
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